{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PCA Basics - Coding Exercises\n",
    "\n",
    "In this notebook you will apply what you learned about PCA in the previous lesson to real stock data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "!{sys.executable} -m pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get Returns\n",
    "\n",
    "In the previous lesson we used 2-dimensional randomly correlated data to see how we can use PCA to for dimensionality reduction. In this notebook, we will do the same but for 490-dimensional data of stock returns. We will get the stock returns using Zipline and data from Quotemedia, just as we learned in previous lessons. The function `get_returns(start_date, end_date)` in the `utils` module, gets the data from the Quotemedia data bundle and produces the stock returns for the given `start_date` and `end_date`. You are welcome to take a look at the `utils` module to see how this is done.\n",
    "\n",
    "In the code below, we use `utils.get_returns` funtion to get the returns for stock data between `2011-01-05` and `2016-01-05`. You can change the start and end dates, but if you do, you have to make sure the dates are valid trading dates. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import utils\n",
    "\n",
    "# Get the returns for the fiven start and end date. Both dates must be valid trading dates\n",
    "returns = utils.get_returns(start_date='2011-01-05', end_date='2016-01-05')\n",
    "\n",
    "# Display the first rows of the returns\n",
    "returns.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualizing the Data\n",
    "\n",
    "As we san see above, the `returns` dataframe, contains the stock returns for 490 assets. Eventhough we can't make 490-dimensional plots, we can plot the data for two assets at a time. This plot willl then show us visually how correlated the stock returns are for a pair of stocks.\n",
    "\n",
    "In the code below, we use the `.plot.scatter(x, y)` method to make a scatter plot of the returns of column `x` and column `y`. The `x` and `y` parameters are both integers and idicate the number of the columns we want to plot. For example, if we want to see how correlated the stock of `AAL` and `AAPL` are, we can choose `x=1` and `y=3`, since we can see from the dataframe above that stock `AAL` corresponds to column number `1`, and stock `AAPL` corresponds to column number `3`. You are encouraged to plot different pairs of stocks to see how correlated they are.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Set the default figure size\n",
    "plt.rcParams['figure.figsize'] = [10.0, 6.0]\n",
    "\n",
    "# Make scatter plot\n",
    "ax = returns.plot.scatter(x = 1, y = 3, grid = True, color = 'white', alpha = 0.5, linewidth = 0)\n",
    "ax.set_facecolor('lightslategray')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Correlation of Returns\n",
    "\n",
    "Apart from visualizing the correlation between stocks as we did above, we can also create a correlation dataframe that gives the correlation between every stock. In the code below, we can accomplish this using the `.corr()` method to calculate the correlation between all the paris of stocks in our `returns` dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the correlation between all stock pairs in the returns dataframe\n",
    "returns.corr(method = 'pearson').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, this is a better way to see how correlated the stock returns are than through visulaization. By looking at the table we can easily spot which pairs of stock have the highest correlation. \n",
    "\n",
    "# TODO:\n",
    "\n",
    "In the code below, make a scatter of equity `A` and equity `XOM`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Set the default figure size\n",
    "plt.rcParams['figure.figsize'] = [10.0, 6.0]\n",
    "\n",
    "# Make scatter plot\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO:\n",
    "\n",
    "In the code below, write a function `get_num_components(df, var_ret)` that takes a dataframe, `df`, and a value for the desired amount of variance you want to retain from the `df` dataframe,`var_ret`. In this case, the parameter `df` should be the `returns` dataframe obtained above. The parameter  `var_ret` must be anumber between 0 and 1. The function should return the number of principal components you need to retain that amount of variance. To do this, use Scikit-Learn's PCA() class and its `.explained_variance_ratio_`. The function should also print the total amount of variance retained. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import resources\n",
    "\n",
    "\n",
    "\n",
    "def get_num_components(df, var_ret):\n",
    "    \n",
    "    # Implement Function\n",
    "    return needed_components\n",
    "        \n",
    "            \n",
    "num_components = get_num_components(returns, 0.9)\n",
    "\n",
    "print('\\nNumber of Principal Components Needed: ', num_components)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO:\n",
    "\n",
    "In the previous section you calculated the number of principal compenents needed to retain a given amount of variance. As you might notice you can greatly reduce the dimensions of the data even if you retain a high level of variance (`var_ret` > 0.9). In the code below, use the number of components needed calculated in the last section, `num_components` to calculate by the percentage of dimensionality reduction. For example, if the original data was 100-dimensional, and the amount of components needed to retian a certain amount of variance is 70, then we are able to reduce the data by 30%. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate the percentage of dimensionality reduction\n",
    "\n",
    "print('We were able to reduce the dimenionality of the data by:', red_per, 'percent')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solution\n",
    "\n",
    "[Solution notebook](pca_basics_solution.ipynb)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
